Towards a General Model of Grapheme Frequencies for Slavic Languages
نویسندگان
چکیده
The present study discusses a possible theoretical model for grapheme frequencies of Slavic alphabets. Based on previous research on Slovene, Russian, and Slovak grapheme frequencies, the negative hypergeometric distribution is presented as a model, adequate for various Slavic languages. Additionally, arguments are provided in favor of the assumption that the parameters of this model can be interpreted with recourse to inventory size. 1 Graphemes and Their Frequencies The study of grapheme frequencies has been a relevant research object for a long time. From a historical perspective, only a small part of the studies along this line have been confined to the mere documentation of grapheme frequencies, considering this to be the immediate object and ultimate result of research. Other approaches have considered the establishment of grapheme frequencies to be the basis for concrete applications. In fact, relevant studies in this direction have often been motivated or accompanied by an interest in rather practical issues such as, for example, the optimization of technical devices, the structure of codes and processes of information transfer, crytographical matters, etc. A third line of work on grapheme frequencies has been less practically and more theoretically oriented. In this framework, research has recently received increasing attention from quantitative linguistics. As compared to the studies aoutlined above, the focus of this renewed interest has shifted: In a properly designed quantitative study, counting letters (or graphemes), presenting the corresponding absolute (or relative) frequencies in tables, or illustrating the results obtained in figures, is not more and not less but one particular step. In this framework, data sampling is part of the empirical testing of a previously established hypothesis, motivated by linguistic research and translated into statistical terms. The empirical testing thus provides the basis for a decision as to the initial hypothesis, and on the basis of their statistical interpretation one can strive for a linguistic interpretation of the results (cf. Altmann 1972, 1973). Providing and presenting data thus is part of scientific research, and it is a necessary pre-condition for theoretical models to be developed or elaborated. As 74 Peter Grzybek and Emmerich Kelih far as such a theoretical perspective is concerned, then, there are, from a historical perspective (for a history of studies on grapheme frequencies in Russian, which may serve as an example, here, cf. Grzybek & Kelih 2003), two major directions in this field of research. Given the frequency of graphemes, based on a particular sample, one may predominantly be interested in 1. comparing the frequency of a particular grapheme with its frequency in another sample (or in other samples); the focus will thus be on the frequency analysis of individual graphemes; 2. comparing the frequencies of all graphemes in their mutual relationship, both for individual samples and across samples; the focus will thus be on the analysis and testing of an underlying frequency distribution model; this approach includes – if possible – the interpretation of the parameters of the model. In our studies, we follow the second of these two courses. We are less interested in the frequency of individual graphemes. Rather, our general assumption is that the frequency with which graphemes in a given sample (text, or corpus, etc.) occur, is not accidental, but regulated by particular rules. More specifically, our hypothesis says that this rule, in case of graphemes, works relatively independent of the specific data quality (i.e., with individual texts as well as with text segments, cumulations, mixtures, and corpora). Translating this hypothesis into the language of statistics, we claim that the interrelation between the individual frequency classes is governed by a wider class of distributions characterized by the proportionality relation given in (1):
منابع مشابه
A data-based classification of Slavic languages: Indices of qualitative variation applied to grapheme frequencies
The Ord’s graph is a simple graphical method for displaying frequency distributions of data or theoretical distributions in the two-dimensional plane. Its coordinates are proportions of the first three moments, either empirical or theoretical ones. A modification of the Ord’s graph based on proportions of indices of qualitative variation is presented. Such a modification makes the graph applica...
متن کاملSpeech recognition for east Slavic languages: the case of Russian
In this paper, we present a survey of state-of-the-art systems for automatic processing of recognition of under-resourced languages of the Eastern Europe, in particular, East Slavic languages (Ukrainian, Belarusian and Russian), which share some common prominent features including Cyrillic alphabet, phonetic classes, morphological structure of wordforms and relatively free grammar. A large voca...
متن کاملWays to Improve N-gram Language Models for Ocr and Speech Recognition of Slavic Languages
The problems of n-gram models for the OCR and speech recognition for the Slavic languages are investigated. The paper proposes methods applicable for most Slavic languages. Two approaches are tested: filtering of the n-gram model and the alternative ways of carrying out the smoothing. The filtering relies on heuristics based on frequencies and morphological features of words. The smoothing uses...
متن کاملInyestigafion of H2 Adsorption on Grapheme by DFT Methods
We optimized the geometries of the graphene and graphene with hydrogen using PW91VWN, PWCIPL,MPWLYP, G96LYP, G96141.0-210.6-310, 6-31G*Ievels of theory and compared our results with each other.We present the most important structural parameters determined for the addition of a hydrogen atom tographene and the outward movement of the carbon atom that is bonded to hydrogen is 0.48 A Also wecalcul...
متن کاملSlavic languages: comparative morphosyntactic analysis
This paper discusses the results of a comparative study of distributional equivalences among adjectivals in four Slavic languages, namely, Rus-sian, Czech, Polish and Serbo-Croatian. A procedure for determining equivalence is defined, and is applied to the results of analyzing the adjectivals of each language with respect to gender, animateness, and case and number. A appropriate goal for prese...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006